1.1. Introduction
Tree-based methods stratify or segment the predictor space into a number of simple regions1.
As the spliting rules to make these decision regions can be summerised in a tree structure, these approaches are called decision trees.
A decision tree can be thought of as breaking data down by asking a series of questions in order to categorise samples into the same class.
NOTES
"In the context of the different categories of machine learning algorithms that we defined at the beginning of this course, we may categorize decision trees as follows:
TODO
https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/06-trees__notes.pdf
Already Cloned
Root node: no incoming edge, zero, or more outgoing edges.
Internal node: one incoming edge, two (or more) outgoing edges.
Leaf node: each leaf node is assigned a class label if nodes are pure; otherwise, the class label is determined by majority vote.
Parent and child nodes: If a node is split, we refer to that given node as the parent node, and the resulting nodes are called child nodes.
Notes
Figure 1: Categorical Decision Tree
The "palmer penguins" dataset2 contains data for 344 penguins from 3 different species and from 3 islands in the Palmer Archipelago, Antarctica.
Artwork by @allison_horst
Figure 2: Penguin Buddies
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
An algorithm starts at a tree root and then splits the data based on the features that give the best split based on a splitting criterion.
Generally this splitting procedure occours until3,4...
NOTES
TODO
Below is an example of a very shallow decision tree where we have set max_depth = 1.
Terminology (Reminder)1
Extra: dtreeviz
Heres an additional visualisation package with extra features such as bein able to follow the path of a hypothetical test sample.
I don't use dtreeviz in the lectures, as it can be a bit of a hassle to setup. However you may also find this a useful way of thinking about the splitting.
Notes
We can make the tree "deeper", and therefore more complex, by setting the max_depth = 3.
We could also use more than 2 features as seen below.
NOTES
dtreeviz).We could also also easily extend this to have more than a 2 (binary) class labels.
In a general sense this approach is pretty simple, however there are a number of design choices and considerations we have to make including5:
Most decision tree algorithms address the following implimentation choices differently5:
There are a number of decision tree algorithms, prominant ones include:
Notes
Scikit-Learn uses an optimised version of the Classification And Regression Tree (CART) algorithm.
Notes
An algorithm starts at a tree root and then splits the data based on the feature, $f$, that gives the largest information gain, $IG$.
To split using information gain relies on calculating the difference between an impurity measure of a parent node, $D_p$, and the impurities of its child nodes, $D_j$; information gain being high when the sum of the impurity of the child nodes is low.
We can maximise the information gain at each split using,
$$IG(D_p,f) = I(D_p)-\sum^m_{j=1}\frac{N_j}{N_p}I(D_j),$$where $I$ is out impurity measure, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples in the $j$th child node.
Some algorithms, such as Scikit-learn's implimentation of CART, reduce the potential search space by implimenting binary trees:
$$IG(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left})-\frac{N_\text{right}}{N_p}I(D_\text{right}).$$Three impurity measures that are commonly used in binary decision trees are the classification error ($I_E$), gini impurity ($I_G$), and entropy ($I_H$)4.
This is simply the fraction of the training observations in a region that does belongs to the most common class:
$$I_E = 1 - \max\left\{p(i|t)\right\}$$Here $p(i|t)$ is the proportion of the samples that belong to the $i$th class $c$, for node $t$.
Notes
For all non-empty classes ($p(i|t) \neq 0$), entropy is given by
$$I_H=-\sum^c_{i=1}p(i|t)\log_2p(i|t).$$The entropy is therefore 0 if all samples at the node belong to the same class and maximal if we have a uniform class distribution.
For example in binary classification ($c=2$):
Notes
Gini Impurity is an alternative measurement, which minimises the probabilty of misclassification,
$$ \begin{align} I_G(t) &= \sum^c_{i=1}p(i|t)(1-p(i|t)) \\ &= 1-\sum^c_{i=1}p(i|t)^2. \end{align} $$This measure is also maximal when classes are perfectly mixed (e.g. $c=2$):
$$ \begin{align} I_G(t) &= 1 - \sum^c_{i=1}0.5^2 = 0.5. \end{align} $$Notes
Classification Error is rarely used for information gain in practice.
This is because it can mean that tree growth gets stuck and error doesnt improve, this is not the case for a concave function such as entropy or gini.
Notes
scikit-learnFigure 12: Child Node Averages
Decision trees allow us assess the importance of each feature for classifying the data.
The importance of a feature is the normalized total reduction of the criterion (e.g. Gini) brought by that feature10.
[INSERT ALGEBRA]
bill_length_mm: 0.015 bill_depth_mm: 0.030 flipper_length_mm: 0.955 body_mass_g: 0.000 total: 1.0
Question: When do we stop growing a tree?
Occam’s razor: Favor a simpler hypothesis, because a simpler hypothesis that fits the data equally well is more likely or plausible than a complex one5.
To minimize overfitting, we can either set limits on the trees before building them (pre-pruning), or reduce the tree by removing branches that do not significantly contribute (post-pruning).
NOTES
load_breast_cancer
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mean radius 569 non-null float64 1 mean texture 569 non-null float64 2 mean perimeter 569 non-null float64 3 mean area 569 non-null float64 4 mean smoothness 569 non-null float64 5 mean compactness 569 non-null float64 6 mean concavity 569 non-null float64 7 mean concave points 569 non-null float64 8 mean symmetry 569 non-null float64 9 mean fractal dimension 569 non-null float64 10 radius error 569 non-null float64 11 texture error 569 non-null float64 12 perimeter error 569 non-null float64 13 area error 569 non-null float64 14 smoothness error 569 non-null float64 15 compactness error 569 non-null float64 16 concavity error 569 non-null float64 17 concave points error 569 non-null float64 18 symmetry error 569 non-null float64 19 fractal dimension error 569 non-null float64 20 worst radius 569 non-null float64 21 worst texture 569 non-null float64 22 worst perimeter 569 non-null float64 23 worst area 569 non-null float64 24 worst smoothness 569 non-null float64 25 worst compactness 569 non-null float64 26 worst concavity 569 non-null float64 27 worst concave points 569 non-null float64 28 worst symmetry 569 non-null float64 29 worst fractal dimension 569 non-null float64 dtypes: float64(30) memory usage: 133.5 KB
None
Extra
You can explore the data below although I recomend limiting the number of features to plot for ease.
An a priori limit on nodes, or tree depth, is often set to avoid overfitting due to a deep tree4,5.
Notes
sklearn, although I'm sure you can find some code for it somewhere on the internet.TODO
We could also set a minimum number of data points for each node5.
In general, post-pruning consists of going back through the tree once it has been created and removing branches that do not significantly contribute to the error reduction and replacing them with leaf nodes6
Two common approaches are reduced-error pruning and cost-complexity pruning
Using Scikit-learn, we can recursively fit a complex tree with no prior pruning and have a look at the effective alphas and the corresponding total leaf impurities at each step of the pruning process.
As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.
We can select the alpha that reduces the distance between the train and validation scores.
Notes
Scikit-learn 0.22 the parameter ccp_alpha was introduced (short for Cost Complexity Pruning- Alpha)DecisionTreeClassifier.cost_complexity_pruning_path returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process.Then we can train a decision tree using the chosen effective alpha.
ID3 - Iterative Dichotomizer 38
C4.59
Notes
Now might be a good time to try exercises X-X.